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IHTHORUCTIOH 

The growth of data warehousing has been led 
not by marketing hype but by the real value of 
the information that organization* everywhere 
! ate deriving From their previously untapped 

j data. I 
I Organizations implementing data warehouses 

; find that the methods and tools that served 
| them so well with operational systems often fail 
';■ short of the demands of the data warehouse, 
i This is particularly true of traditional relational 
! database management systems (RDBMSs). 

i Systems designed for on-line transaction pro- 

i tossing (OLTP) do not easily address the j 
I requirements of data warehousing. In particular, t 
I the data models and schcmas that have been so 
' successful for OLTF systems are not well-suited, 
to ad hoc analysis of complex data. 

Ongoing experience in data warehousing heJps 
identify better schemas for complex queries. 
The star schema, in particular, has gained wide- 
Spread acceptance for the intuitive nature of its 
| design. Unfortunately, traditional OLTp systems \ 
■ aie not designed for the multidimensional ! 
j nature of the star schema and as a result pro- \ 
vide poor performance on complex queries to ; 
this schema. This paper describes the causes ! 
and nature of these performance problems, and j 
the family of technologies that Red Brick has j 
implemented to address these issues in the data \ 
warehouse environment. \ 

I | 

j - t 

i 



After many years of experience, businesses in 
all industries have become adept at implement- 
ing efficient operational systems such as payroll, 
inventory tracking, and purchasing. Businesses 
have long known how to define RDBMS 
schemas that allow operational systems to oper- 
ate at peak efficiency, expediting a large number 
of small, but simultaneous read /write requests. 
Schema definition often focuses on defining 
tables that map very efficiently to operational 
requests while minimizing contention for access 
to individual records. In other words, opera- 
tional systems maximize concurrency and opti- 
mize insert/update/dclete performance. 
The demands placed on the database by a data 
warehouse are very different A data warehouse 
typically needs io process queries that are large, 
complex, ad hoc and data-intensive. Not only 
are there significant technological differences in 
how these systems consume computing 
resources, but the nature of what is being done 
requires a fundamentally different approach to 
defining the database schema. 
These differences are best illustrated by a simple 
example. Figure 1 shows a typical OLTP schema 
for an order entry system. This schema works 
quite well for an operational system that can 
efficiently create, update, and process individ- 
ual purchase orders (POs); however, it is not 
well suited for analyzing purchase patterns. For 
example, an analyst structuring a query to list 
company name, cost of goods, source location 
and destination would have to spend some time 
figuring out how to navigate the various tables; 
the process would not be intuitive. 

Data warehouses require a new query-centric 

view Of the data; star schemas provide such a 
view. The basic premise of star schemas is that 
information can be classified into two groups: 
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Figure 1. Typical OLTP Schema 
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Figure 2. Star Schcr 
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(acts and dimensions. Facts are the cote data 
element being analyzed. In our example, pur- 
chases of individual items are facts, while in a 
point-of-sale (FOS) database sales are facts. 
Dimensions arc attributes about the facts. In our 
example, these are the item purchased and the 
date purchased, or in a POS database, the dale 
of sale and the item sold. Most, if not all, analy- 
sis is based on these dimensions and hence Uie 
term dimensional analysis. 
Given this two-way classification of informa- 
tion, Figure 2 shows an alternative schema for a 
purchase order database. Asking the same busi- 
ness question against this schema is much more 
straightforward because we are looking up spe- 
cific facts ( PURCHASES ) through a Set of 
dimensions (SHiP_rRon, SHIPJTO, ITEM), 
It's also significant that, in the typical star 
schema, the fact table is much larger than any of 
its dimension tables. This point becomes impor- 
tant as we consider the performance Issues asso- 
ciated with star schemas later in this paper, 



The intuitive structure of a star schema makes it 
better suited to the data warehouse environ- 
ment than traditional OLTP schemas, A data 
warehouse must respond to a barrage of ad hoc 



queries. Its success depends, in part, on the 
queries being easy to formulate— analysts can- 
not spend hours navigating a mast of interrelat- 
ed tables. The star schoma is rapidly gaining 
acceptance at- the schema of choice for data 
warehouse applications because of its intuitive 
approach to presenting data. 
The purchasing example is relatively simple: 
other real world applications are more complex, 
requiring multiple linked fact tables which 
share common dimension tables. Extended 
through multiple fact tables, the star schema can 
handle the most complex environments, such as 
those in the health care or telecommunications 
industries. 



Because mDSt experts agree that Star schemas 
are the preferred modeling method fardata 
warehousing, you might think that the database 
issues are more Dr less resolved. Unfortunately, 
this is not the case. As implementors have tried 
to host Intuitive query-centric schemas on tradi- 
tional OLTF databases, they have encountered a 
major obstacle: poor performance. 

Pah-Wise Join Prosum 
Traditional OLTF database engines are not 
designed for the rich set of complex queries that 
can be issued against a star schema. In particu- 
lar, the need to retrieve related information from 
several tables in a single query— "join process- 
ing"— is severely limited. Traditional OLTP 
databases typically join two tables at a time. If a 
complex join involves more than two tables, the 
RDBMS needs to artificially break the query into 
a series of pair-wise joins. Pair-wise joining is 
suitable for the simple requests of an OLTP sys- 
tem, but cannot perform adequately in a data 
warehouse environment. 
The limitation of the pair-wise join is best illus- 
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bated in an example. Using our example star 
schema (Figure 2), if you wanted to list the com- 
pany name, purchase price, source location, des- 
tination location, and purchase date, you would 
need to join data from six tables: COMPANY (for 
the company name), PURCHASES (for the pur- 
chase prkc), SHlP_FROM (for the source city), 
3HXP_TO (lor the destination city), ITEM (for 
the item in question), and DATE (for the pur- 
chase date). A traditional OLTp database would 
have to select two tables to Join initially, such as 
company and PURCHASES. These two tables 
would be joined and generate an intermediate 
result consisting of COMPANY joined to PUR- 
CHASES. This intermediate join result would he 
joined with another table, perhaps 5HIP_from, 
to produce another Intermediate join result. This 
process would continue, generating four inter- 
mediate results before creating the full result 
(see Figure 3). These Intermediate results can be 
large and very costly to create. 



Figure 3. Pair-Wise Joining 
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The order in which the joins are done dramati- 
cally affects query performance. As an extreme 
example, if the join of ITEM to purchases 
results in the selection of only a single record, 
subsequent joins would only be joining this one 
record. However, if PURCHASES is joined to 
COMPANY first, an Intermediate result might 
contain every single row In purchases. 
Because selecting the order of the pair-wise 
joins can have such a dramatic performance 
impact, traditional OLTP databases waste a 
great deal of effort on finding the best order in 
which to execute those joins. 
Beca use the nu mber of combina tions to be eva I- 
uated grows exponentially with the number of 
tables being joined, the problem di selecting the 
best order of pair-wise joins rarely can be solved 
m a reasonable amount of time. The number of 
ways to pair-wise join a set of N tables is Nl.» 
For example, a six-table query has 
6! = (6x5x4x3x2x1) * 720 combinations. For a 
seven-table query, there would be 
(7x6x5x4x3x2x1) s= 5,040 combinations; for a ten- 
table query there would be 3,628,800 combina- 
tions, and so on. 

Even these numbers are misleading because 
when evaluating a join between two tables, the 
RDDMS may have many different join algo- 
rithms with which the tables could be Joined, 
Each of these algorithms may need to be evalu- 
ated for every combination. For example, if 
there are five possible join algorithms, the data- 
base may need 10 evaluate 101x5 = -13 million 
combinations for a ten-table query. This prob- 
lem is so serious that some databases— for 
example, DBZ on an IBM mainframe — will actu- 
ally refuse to run a query that tries to join too 
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many tables. With nil traditional OLTP databas- 
es, it's also important to remember that the task 
of deciding the pair-wise join Order must be 
completed before the query even begins to exe- 
cute. Any time the database spends evaluating 
pair-wise join combinations Is time not spenl 
running the query and answering the business 
question. 

Sta» Schema Join Feodum 

Because the number of pair-wise join combina- 
tions is often too large to evaluate fully, tradi- 
tional OLTP databases pick an "interesting" 
subset for evaluation. How these subsets ate 
selected differs from one database to the next, 
but in general the system starts by picking com- 
binations of tables that are directly related. 
Using our example star schema, joining ITEM 
and PURCHASES would be interesting but join- 
ing item and COMPANY would not, This strate- 
gy works reasonably well with traditional OLTP 
SChemas that contain a rich network of interre- 
lated tables, but unfortunately fails miserably 
for star schemas. 

Looking only at directly related tables doesn't 
work for data warehouses because. In a typical 
star schema, the only table directly related to 
mast other tables is the fact table, This means 
that the fact table is a natural candidate for the 
first pair-wise join. Unfortunately, the fact table 

is typically the very largest table in thu query, so 

this strategy invariably leads to selecting a pair- 
wise join order that generates a very large inter- 
mediate result set. Generating large intermedi- 
ate result sets severely affects query perfor- 
mance. Furthermore, these large intermediate 
sets can fill up available space, stopping further 
query execution. 

Clearly, these problems only compound in a 
complex star schema situation, where you may 
query multiple linked fact tables, and have even 



larger intermediate results. 

Limitations in Optimization Uchniques rot 
Complex Queries 

Traditional cost-based optimizers don't deal 
well with complex queries. The basic premise of 
these optimizers is that you can estimate the 
cost of a query before executing it. 
Unfortunately, for complex queries in data 
warehouse environments, this premise does not 
always hold true. 

Traditional OLTP databases generate cost esti- 
mates based on statistics gathered from the data 
itself. These statistics can only supply crude 
estimates for the complex queries that are typi- 
cal of the data warehousing environment. Worse 
yet, a poor estimation early in the process can 
result In a much larger deviation, in effect mag- 
nifying the initial error as the query proceeds, 
Take as an example the problem of data skew; 
data that is not uniformly distributed among its 
various dimensions. Cost-based optimizers typi- 
cally generate cost estimates based on (he size 
of intermediate results. These decisions can be 
led astray by dependencies between the data 
values. For example, an optimizer that knows 
the percentage of customers in California and 
the percentage of customers purchasing con- 
vertibles will operate as if the variables are 
independent, when in fact, California may 
account for a higher percentage of convertibles, 
relative to its size, than North Dakota oi the 
state of Washington, 



The net effect is that even if the ti 
OLTP database happens to stumble onto an 
optimal strategy, it may dismiss it as too costly 

because of errors in the cost estimation process- 
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The performance problems described here have 
not gone unnoticed by the traditional OLTP 
database providers who are beginning to offer 
"solutions" to these critical problems. 
Unfortunately for the data warehouse Irnple- 
mentflr, such solutions are constrained by the 
very infrastructures that have- made these 
RDBMSs ao successful in OLTP applications. 
Nonetheless, it is important to understand these 
pseudo-solutions and their limitations, 

Picking Untt Join Paibs 

A common optimization that provides some 
relief for Ihe star schema join problem is evalu- 
ating more combinations of pair-wise join order- 
ings. The basic idea is to try to get around the 
pair-wise join strategy of only selecting related 
tables. 

To understand how this works, you first need to 
understand what It means to join unrelated 
tables, When two tables are joined and no 
columns "link" the Cables, every combination of 
Hie two tables" rows are produced, in RDBMS- 
speak, this is called a "Cartesian product." For 
example, if the item table had twu rows 
("ps P er r " "tape") arid the company, table had 
three rows ("Sears," "KMart," "Walmart") the 
Cartesian product would contain six rows: 
"paper" +■ "Sears," "tape"+ "Sears," "paper'V 
"KMart," "tape"+ "KMart," "paper"* 
"Walmart," and "tape"+ "Walmart," Normally, 
the RDBMS would never consider Cartesian 
products as reasonable pair-wise join candi- 
dates, but for star schemas, there are times 
when these Cartesian products improve query 
performance. 

Generating Cartesian products enn sometimes 
improve query performance because the fact 
table in a star schema is typically much larger 
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than its dimension tables, Remember that one of 
the problems with selecting related tables to join 
is that the fact table Is chosen very early on for a 
pair-wise join. This can be a very bad choice 
because it may generate a large intermediate 
result due to the sheer si2e of the fact table. 
Alternatively, if a Cartesian product of all 
dimension tables is first generated (by succes- 
sive pair-wise joins), the join to the fact table 
can be deferred until the very end. The key ben- 
efit is that the large fact table does not find its 
way into any of the intermediate join results. 
The key cost is the need to generate the 
Cartesian product of the dimension tables. As 
long as the cost of generating Die Cartesian 
product is less than the cost of generating inter- 
mediate results with the fact table, this opti- 
mization has some benefit. 

This simple optimization trick clearly isn't o 
panacea. Remember that although the query 
against the star schema is naturally a multi-table 
join, the RDBMS la still artificially breaking it 
down into a set of pair-wise joins. Worse yet, 
this strategy Is only viable if the Cartesian prod- 
uct of dimension rows selected Is much smaller 
than the number of rows in the /act table. To 
illustrate using our example schema, if there 
were 10,0DD ITEM rows, 500 ski £>_FRQM tows, 
500 3HIP_TQ rows, 3,000 GATE rows, and 1,000 
COMPANY rows, the final intermediate result size 
could be 7,500 trillion rows! 2 Using this strate- 
gy, processing would probably never complete, 
and the RDBMS would have to use the more 
traditional pair-wise join ordering. The multi- 
plicative nature of the Cartesian join makes the 
optimization helpful only for relatively small 
problems. This last point comes close to the 
heart of the matter: "small" and "data ware-'' 
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house" are not synonymous. 

Improving Siak Schema Performance with 
Paraluusm 

Many traditional OLTP database vendors tout 
parallelism as the answer to performance prob- 
lems for data warehousing. On the surface, par- 
allelism appears particularly attractive in light 
of the ever-growing number of high-perfor- 
mance symmetric-multiprocessing (SMP) and 
massively-parallel processing (MPF) systems on 
the market. Unfortunately /or business users, 
parallelism— although an Important component 
of a viable data warehousing RDBMS — is not 
sufficient to overcome the limitations of an 
RDBMS designed for OLTP. 

| Through parallelism, users can either reduce the 

I execution time of a single query (speed-up) or 

i handle additional work without degrading exe- 

. cutlon time (scale-up), speed-up is achieved by 

| applying more computing resource 

t (CPUs/disks) to the query while not increasing 

j the amount of data die query needs to process. 

1 Scale-up is achieved by applying enough addi- 

: tlonal computing resource to the query to com- 

i pensate for the additional data the query will 

i need to process. Most databases have parallel 

query capability of one form or another and are 

j able to achieve these objectives Co varying 

t Unfortunately, these objectives completely 

t ignore the basic cost-performance criterion cen- 
tral to commercial information systems. When 

\ applied to traditional brute force join strategies, 

j parallelism tries to solve the performance prob- 

j lem by throwing enough dollars (computing 

1 resources) a t uie problem to make execution 

i time bearable. Tins is best illustrated by on 

; example: 

i Assume that the best pair-wise algorithm 

< processes a certain multi-table query in 500 sec- 



onds. If we assume that this algorithm is per- 
j fectly parallellzable, a ten-processor SMP sys- 
! tern would process the query in SO seconds. Not 
| bad, although there are other alternatives, For 
j example, suppose there is a "super algorithm" 
I that is ten times more efficient at processing 
multi-table joins than the pair-wise join algo- 
] rithm. This super algorithm would process the | 
same query in 50 seconds on a single processor. j 
The parallelized super algorithm on the ten- j 
processor system would process the query In an j 
amazing five sccondsl 

Clearly, the super algorithm wins. You get the j 
same 50-second performance time as with the ! 
ten-processor system, but you save the cost of j 
the other nine processors or you use the proces- | 
sors to run Other queries. j 

In fact this analysis paints too rosy a picture for i 

traditional databases: traditional systems don't t 

scale perfectly with additional processors. A II j 

current traditional OLTP databases have over- j 

■ head associated with parallelism that results in j 
less than linear scale-up. For example, going j 

I from 10 CPUs to 20 CPUs may only reduce the | 

• execution time of a 60-minute query to 45 min- 

i utes (a 25 percent speedup instead of the 50 per- t 

cent you might expect). As more CPUs are j 

added, the problem usually gets worse and, in I 

the extreme, adding more CPUs may actually ' j 

stow down the query. j 

I Furthermore, OLTP parallel implementations da > 

not hove a strategy for applying the right 1 

■ degree of parallelism automatically for queries: ■ 
j parallelism on demand, as It were. For example, j 

• one query may benefit from a small degree of \ 
\ parallelism, while adding extra processors will j 
j actually waste resources without improving j 
> performance. Other queries may be able to use 

almost all of the processors you can throw ot ; 
; them. These OLTP systems leave it up to the 
! database user to manually tune each query by ' 
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setting the appropriate degree of parallelism on 
a query- by-query basis. Clearly this approach 
depends heavily on the human expertise and 
intervention. 

Finally, parallelism really shouldn't be consid- 
ered in isolation from the rest of the system's 
activities. At times when the system is relatively 
unused, you might choose to use more paral- 
lelism because you'll have less Impact on other 
running queries and processes. At other, heavi- 
ly-used times, a query that U3e3 a high degree of | 
parallelism can shut down performance for j 
other users of the system. Parallelism in a pro- , 
duction environment should analyze the larger [ 
issue of the system's resources as part oF the { 
determination of how much parallelism to use. j 

BIT VECTOR "STAR™ JOINS 

Another technique some databases use to speed 
join processing is the bit vector "star" Join. This 
join technique speeds join processing by pro- 
cessing multiple bit vector indexes. Some of the 
traditional database vendors are starting to add 
this kind of bitmap indexing and processing to i 
provide performance gains. 

A traditional B-tree Index maintains a list of j 
iow-lDs for each possible value of a column. 
This technology works well for many OLTP sys- ! 
terns, but the lists of row-IDs may be long, and j 
performing complex processing on lists of these 
lists requires a lot of CPU time. A bitmap index : 
stores informa tion much more efficiently in i 
many situations. 

But the real benefit of the bitmap Index is pro- 
cessing speed, particularly when combining 
constraints. The process of performing a logical 
Operation (AMD or Oft, for example) on « 3crice 
of bitmap indexes is very efficient, particularly 
compared with performing similar processes on 
lists of row-IDs. 



Bit vector star joins create bitmap lists of rows j 
meeting the various restrictions In a query, and 
then intersect those rows for a new bitmap (d all 
of the rows meeting all of the restrictions. This 
kind of processing is very efficient when com- 
pared to processing long lists of tdw-ID in B- j 
tree indexes. i 
Bit vector joins are one Interesting solution to 
multidimensional joins. But this approach by j 
itself docs have a number of limitations: ! 

• Bit vector star joins work well for a limited j 
class of queries; they do not provide a 
comprehensive solution for complex j 
queries- j 

• They don't address the issues Of data j 
skew. j 

• They don't adjust the degree of parallelism j 
appropriate for the query. i 

• They don't address the estimation prob- | 
lems wilh traditional cost-based optimiz- 
ers, i 

• Because this strategy uses bit vector index- j 
es. It shares the restrictions of those index- I 
es. For example, traditional bit vector ] 
indexes only work well with a small num- i 
ber of domain values. As a result, bit vec- \ 
toi target Joins only work well on those i 
kinds of columns. 

Although they represent one method for han- j 
dling complex query processing, bit vector star j 
joins are by no means a comprehensive solution 
to the complex queries tha t data warehousing 
generates. 
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At this point, you may be thinking that nil is 
lost, thatRDBMSs are poor candidates for data 
warehouses, and that other non-KDBMS solu- 
tions must be pursued. Fortunately, lite relation- 
al model is well suited to data ■warehousing. The 
standardization of SQL and the wealth of avail- 
able tools and expertise reinforce the Idea that a 
RDBMS is the natural choice for the data reposi- 
tory of the data warehouse. 

What is needed is a new way to efficiently 
process complex queries against data warehouse 
databases; a way that doesn't suffer from the 
many problems previously discussed. The com- 
plex query processing required in data ware- 
housing environments presupposes a compre- 
hensive and flexible approach to star schema 
processing. Bed Brick's RDBMS, by virtue of 
having been desired from its core to deal with 
data warehousing environments, is uniquely 
positioned to deliver such a solution, lied Brick 
has implemented a a combination of unique, 
high-performance indexes, advanced optimisa- 
tion techniques and sophisticated join process- 
ing algorithms to address the problem of multi- 
table joinS- 

Red Brick's singular focus on data warehousing 
delivers a star schema solution unsurpassed in 
the range and sophistication of technology it 
offers for multidimensional joins. These include: 

• Red Brick's unique STARjoin""* processing 

• TARGETjoin™ technology as a comple- 
mentary star schema processing technique 

• Dynamic incremental optimization™ — a 
radically different cost-based optimizer 

that ^elects Che best Strategy for each query 

• TARGETindex™ —advanced bitmap index 
technology 

• Parallelism-on-demand™ — intelligunt 
parallelism for optima) perfomnance 
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By providing multiple, high-performance alter- 
natives for multidimensional joins and an 
advanced optimizer for selecting among them, 
Red Brick Warehouse solves a wide range of 
join problems with optimal performance. This 
range of star schema processing solutions gives 
Red Brick the flexibility to handle real-world 
performance issues— problems Such as unex- 
pected queries, widely-varying user loads, and 
skewed data, 

It's not enough lo provide only one or maybe 
two of the components of the Red Brick solu- 
tion. A comprehensive and v\ 



tion is required to provide the flexibility and 
performance required in the data warehouse 
environment. 

Let's look individually at each component of 
Red Brick's star schema processing solution, 

Res Buck's STARjoiN 

Red Brick's STARjoln is a high-speed, single- 
pass, parallelizable multi-table Join, Red Brick's 
unique STARjoin technique overcomes the prob- 
lems that plague traditional OUTP database 
products. Moreover, even when joining only 
two tables, STARjoin is unlike any join method 
implemented by traditional OLTP databases. 
These differences translate into unequaled join- 



In understanding the advantage of STARjoin, it 
is useful to compare it to another time-honored 
technique for improving query performance: 
indexing. The use of indexes to accelerate query 
processing has long been a standard capability 
incorporated Into all RDBMS products, When 
indexes are defined on selected columns of a 
table onrj the query constrains on those 
columns, the RDBMS can use the index to very 
quickly identify the rows of interest. In a similar 
way, Red Brick's RDBMS supports the creation 
of specialized indexes, called STARindex™, to 
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dramatically accelerate join performance. 

Bed Brick's RDBMS is the only RDBMS to sup- | 

port these specialised indexes for join process- 1 

ing. The STARindex is unlike any Index struc- \ 

hire present in any other RDBMS. In other j 

words, you cannot gain the benefits that Red ] 

Brick provides with a STARindex by using any ( 

combination of traditional B-tree or bitmap \ 

indexes, j 

STARindexes are created on One or more foreign | 
key columns of a fact tabic. Unlike traditional '. 
indexes that contain information lo translate a 
column value to a list of rows with that value, a 
STARindex contains highly compressed infor- 
mation that relates the dimensions of a fact table 
to the rows that contain those dimensions, 
Because STARindexes are so space-efficient, 
they can be built and maintained very rapidly. 
The STARjoin algorithm can make use of the 
STARindex to efficiently Identify all the rows 
required for a particular join. 

To better understand how STARjoins can use 
STARindexes so effectively, it is useful to con- 
trast STARindexes with the composite key 
indexes often supported by traditional OLTP 
databases. These composite indexes consist of 
two or more column values within the table 
being indexed. When these indexes are defined 
and a query references these column values, the 
index allows very rapid selection of target rows. 
Similarly, the presence of a STARindex allows 
Red Brick's RDBMS to rapidly identify which 
target rows of the fact table are of interest for a 
particular set of dimensions, Also, because 
STARindexes are created over foreign keys, no I 
assumptions are made about the type of queries | 
which can use the STARindex, In ether words, 
STARindexes do not limit the kind of queries or ! 
types of joins that can be done, but accelerate 1 
queries that join related tables, \ 



There are few similarities between STARindexes 
and traditional multi-column indexes. One key 
difference is that multi-column indexes refer- 
ence a single table whereas the STARindex la 
the only index to reference multiple tablea- 
Another key difference is that, with multi-col- 
umn indexes, if a query does not constrain on 
all the columns in the composite index, the 
index cannot be fully used unless the specified 
columns are a leading subset. For example, 
Index(A,B,C) could be used if A+B+C are con- 
strained, if A+B arc constrained, or if just A is 
constrained, but not if A+C is constrained. The 
analogous situation for a STARindex would be a 
query that docs not reference all the tables 
defined in the STARindex. Fortunately, unlike 
multi-column indexes, Red Brick's RDBMS can 
use STARindexes independent of the combina- 
tion or order of constraints. Hence it lends itself 
to a variety of analyses resulting In different 
patterns of constraint processing. 
Consider again the traditional OLTP database 
approach of generating a full Cartesian product 
of dimension tables, which then can be joined to 
the fact table. Of course, this doesn't work well 
for even moderately-sized dimension tables 
because of the size of the Cartesian product 
However, what if you could get the benefits of 
efficiently joining the dimension tables to the 
fact toble without the penalty of generating the 
full Cartesian product? The result would be the 
best of all possible worlds with performance far 
exceeding any of the traditional OLTP database 
approaches. The result would be a STARjoin, 

STARjoin can work this seemingly impossible 
feat because STARindexes allow STARjoin to 
quickly identify which regions of the Cartesian 
product space contain rows of interest (in much 
the same way that a B-tree index can quickly 
identify which rows contain column values of 
Interest). The STARjoin algorithm generates a 
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Cartesian product in regions where there art 
cows of interest and bypasses generating 
Cartesian products over regions where there are 
no rows. As a simple example, if the COMPANY 
dimension table contains 100 rows, a simple 
Cartesian product would generate rows for each 
of the 100 companies. However, if your fact 
table only contains data for two of these compa- 
nies for a particular set of other dimension val- 
ues, you would be unnecessarily generating 98 
combinations- STARjoin would know, by virtue 
of STARindex, that only two companies were 
involved, and it would only generate C 
products for those two companies. 



Figure*, STARJndex 
COMPANY 



ITEM 



DATE SHIPJTO 



A STARindt* un PURCHASES could contain any combi- 
nation of the foreign key rieltki. A dingle STAUindex on 
PURCHASES that contains (COMPANY. tt&M, DATE, 
SI1[?_FROM,SHie„nO) wputd be sufficient to process 
any query against thin star schoma. Howcvor, a <juofy 



TARGETJOIN 

Another key component of Red Brick's STAR 
schema processing is TARGETjoin technology. 
TARGETjoin takes the simple bit vector star join 
technology to a whole new level of sophistica- 
tion, a. Level that is essential for real-world data 
warehouse processing. 
TARGETjoin relies on Red Brick's 
TARGETindexes, a high performance, law- 
maintenance bit vectored index technology 
already thoroughly integrated in the Red Brick 
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RDBMS- TARGETjoin applies sophisticated 
multi-table join processing to solve a variety of 
query types. Among its benefits are; 

• TARGETjoin is built on Red Brick's 
TARGETindexes; these hove many advantages 
over simple bit-map indexes, including: 

» adaptive indexing technology which allows 
the indexes to conform optimally to any 
data skew 

• a family of Indexes that deal with the full 
spectrum of data cardinalities 

• very low overhead in terms of storage Space 
Or update time on loads 

» run-time reordering and early-exit opti- 
mization for truly phenomenal performance 

» TARGETjbins require no special tuning u P ^ 
front except creating TARGETindexes on for- 
eign keys in the fact table 

• TARGETjoins are flexible in handling a wide 
variety of queries 

The TARGETjoin technology assumes the exis- 
tence of a TARGETindex on the relevant foreign 
keys in the fact table, A TARGETjoin uses the 
dimension tables ajid the TARGETindexes to cre- 
ate lists of rows for each restriction in the query, 
and then intersects those lists to create a list of 
the rows to retrieve from the fact table. 
Let's step through this with an example. A query 
asks for the list of purchases over $2000 in the 
month of May by the sales or marketing deport- 
ments. The query restricts the customer, item, 
and date dimensions, and references the fact 
table, purchases. 

Assuming that there are TARGETindexes on the 
item key, customer key and date key in the PUR- 
CHASES table, a TARGETjoin would do the fol- 
lowing; 

■ From the ITEMS table, find items over 
$2000 

» Join this with the TARGETinde* on the item 
key in the purchases table la generate a 



BOW* 1* 

HSC0046973 

PAGE 12116 * RCVD AT 311712009 3:11:16 PM [Eastern Daylight Time] * SVR:USPTO-EFXRF-6/41 * DNIS:2733905 * CSID: * DURATION (mm-ss):05-56 



Mar-1 7-Q9 12:08pm F rom-WOODCOCK WASHBURN LLP 



T-871 P. 013/016 F-976 



list of rows matching this restriction 

■ Perform the first two Steps for the other 
restrictions in the query 

■ Intersect the resulting lists; rows that 
occurred on all lists satisfy all require- 
ments and should be retrieved from the 
sales table 

• Join this list of rows with the relevant rows 
from thv dimension tables for the complete 
result 

The beauty of the solution is its flexibility; its 
only requirement is TARGETindexes on the for- 
eign keys of the fact table. Intersecting the 
bitmap lists is an efficient operation. By only 
retrieving the relevant rows from the fact table, 
the TARGETjoin reduces I/O to the fact tabic 
significantly. 

REP BUCK'S ParalhioH -DEMAND™ 
Parallel processing can provide a real boost to 
complex query performance if the process sup- 
ports parallelism and the optimizer applies it 
appropriately. Parallel processing may introduce 
a detrimental overhead on simpler queries or 
queries that cannot easily be broken into paral- 
lel processes. Red Brick's "parallel-on-demand" 
technology analyzes queries for the optimal 
degree of parallelism. Complex queries may 
have more parallel processing, while simpler 
queries require less. This state-of-the-art parallel 
processing technology is well-integrated in the 
Red Brick offering. 

Should the need arise, die Red Brick database 
has Uie ability to exploit virtually unlimited par- 
allelism. For example, the STARjoin is inherent- 
ly parallel-izablc. The work may be Split into 
multiple, independent processes with a limited 
need 10 coordinate with each Other. Each 
process is given its own piece af the "join 
spate" to work on, which can be divided into an 
arbitrary number of pieces. The result is any 
number of parallel work assignments with no 
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[ need to coordinate— the perfect recipe for scale- 
! up. 

| Red Brick's parallel-on-demand technology is in 
| sharp contrast to OLTP systems which cannot 
! dynamically set degrees of parallelism, For 
! example, consider two nearly identical queries. 
The first selects all purchases of toy trucks in 
the month of May; It joins in PATES and FROD- 
UCTS tables to flic SALES tabic. The second 
i query is identical to the first except that it asks 
i for all toy purchases in the month of May. It also 
; joins the dates and products tables to the 
', SALES table. 

; The Red Brick analyzer will automatically 
; invoke more parallelism to the second query, 
becauso it is selecting all products of the catego- 
ry toy, and leas parallelism lo the first query, 
I which is only looking for toy trucks. Most tradi- 
tional RDBMSs would require that you select 
\ beforehand the appropriate degree of para!- 
! lelism for each query- In effect, this means that a 
usermu5t tune the degree of parallelism on a 
query-by-query basis for optimal performance. 
Red Brick's intelligent parallel-on-demand goes 
even further by looking not only at the query 
but at the current system resource usage. In 
cases where the system is largely idle, it will 
increase parallelism if it provides a benefit, but 
if the system is very busy, it will use less paral- 
lelism, This prevents a large, complex query 
from shutting down the system lo other queries, 
and also enables it to get the best performance 
possible when system usage is low. By consider- 
ing system resources In addition to the query's 
requirements, Red Brick's parallcl-on-dejnand 
technology ensures the best performance for the 
Overall system, in addition to Individual queries. 

DYNAMIC iNCREMEHrAl QUESY QCTlMlUtlOH 

As described before, traditional cost-based opti- 
mization doesn't work well for complex 
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schemas, Red Brick has chosen a very different 
approach to this problem than traditional OLTP 
database vendors, an approach much better 
suited to the complex queries of the data ware- 
housing environment. 

Red Brick's database is the only RDBM5 to sup- 
port an incremental optimization model. The 
basic idea is to determine which sections of tine 
query can be reasonably estimated, and then 
run. The actual results of this step (not esti- 
mates) are used to estimate the next step, elimi- 
nating the problem of propagating estimation 
errors, Beuer yet, a plan can change in mid- 
query if the actual data is radically different 
than statistics would have led you to expect As 
an example, Red Bride's RDBMS uses this tech- 
nique to select the appropriate STARindex only 
after it has became clear how many rows there 
are in the dimension tables to be joined. The 
result is far belter index selection than could be 
achieved with any approach that relies on esti- 



The breadth of Red Brick's STAR schema pro- 
cessing technology provides unmatched Join 
performance in a variety of situations. 
STASindexes provide unsurpassed performance 
for targeted queries. The TARGETjam technolo- 
gy provides a good fallback solution for situa- 
tions in which there are no STARindexes or the 
indexes don't match the query. 



To better ui 



d the significant perfor- 



mance benefit of Red Brick's STAR schema pro- 
cessing, consider this example. Assume that 
there are 500 possible ITEMS, 200 companies, 
300 dates, and one million purchases in our 
database. Further, assume that a particular 
query selects SO ITEMS, 20 COMPANIES, and 30 
DATES that ultimately will Select 1,000 of the 
PURCHASES records. (For simplicity, assume 



uniform distribution of the data— no data skew.) 
A PAIB-WlSt Join 

In this example, the traditional pair-wise join 

might look something like this: 

join PURCHASES + ITEM, 

join (PURCHASES+lTEM) + DATE, 

join (EURCHASES+ITEM+OATE) + COMPANY 

The first join would generate approximately 
100,000 rows, the second would generate 
approximately 10,000 rows, and the final join 
would generate 1,000 rows. So the total number 
of rows generated to process this query is 
111,000. 

A CABIfSIAN JOIN 

The Cartesian product approach would generate 
the full Cartesian product of 50 ITEMS, 20 COM- 
PANIES, and 30 DATES = 50 x 20 x 30 = 30,000 
rows. This Cartesian product could then be 
directly joined to the purchases table to pro- 
duce the final 1,000 rows at a lota! cost of pro- 
cessing 30,000 +■ 1,000 = 31,000 rows, 
TARGETjoih 

Although the TARGETlndex joins generate the 
same number of rows as pair^ise processing 
(111,000), bit vector processing is typically an 
order of magnitude faster than actually joining 
the rows themselves, because you're dealing 
with easily-manipulated bit strings instead of 
lists of row-IDs. So the TARGETjoin would be 
. comparable to a join method that generates 
11,000 rows, 
STARjdih 

A well-constrained STARjoin would actually 
generate only slightly more combinations than 
exist in this selected rows of the PURCHASES 
tabic— on the order of 10 percent more for a 
total cost of 1,000 f (1,000 x 10%) = 1,100 rows. 
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Of course, for various reasons, using row counts 
is a very Crude way of coating out the queries, 
but it does provide an idea of the relative costs 
of the approaches. In this example, the 
Cartesian product approach is roughly 3,5 times 
better than the simple pair-wise approach, but 
STARjoln la 28 times more efficient than even 
this approach, and a whopping 101 times more 
efficient than the simple approachl 
COMPLiX SCHIMAS 

So far our examples have been based on a rela- 
tively simple star schema. Real-world data 
warehousing schcrrtas are often more complex. 
Red Brick's family af technologies handles com- 
plex schemns as well as simple schemas. 

As an example, consider a schema with multiple 
fact tables, with one or mare common dimen- 
sions. Often the queries processed against 
schemHs like this relate not only the fact tables 
to their associated dimensions but also the fact 
tables .to each other. STARindexes are very well 
suited to this task, in particular, joining multiple 
fact tables can by done by selectively accessing 
the STARindexes of the tables being Joined. The 
end result is a complex join being processed at 
speeds simply not possible with traditional 
OLTF databases. In fact these complex schemas 
are even poorer fits for the traditional OLTP 
databases because they have effectively doubled 
all the problems Df the single-fact-table schema. 

Figure 5 illustrates a multi-fact-table star 
schema. 



Figure 5. Multi-Fact-Table Star Schema 
COMPANY SHlP_p«OM STORE 

\ / I ^ DATE 

PURCHASES SALES^ 

/ / \/ \ 
DATE SHIPJO ITEM PROMOTION 



The PURCHASES fact table tracks purchases of 
the items, and the sales fact tabic tracks sales 
Of the same items. Each fact table has dimen- 
sions which are appropriate for the analysis of 
those facts. The two fact tables share a common 
dimension, ITEM, which enables the analyst to 
relate the two fact tables and ask questions such 
as, "How do purchases of a particular set of 
items compare to sales of those same items?" 
This complex schema could present insur- 
mountable hurdles for a traditional RDBMS, but 
will perform exceptionally well with Red Brick's 
STARjoin technology. 



The real power of the Red Brick family of tech- 
nologies is integration and the availability Df 
both join procedures with advanced optimiza- 
tion and Intelligent parallelism. Well-defined 
STARindexes provide Optimal performance for 
critical queries, while TARCETmdexes on the 
foreign keys enable TARGETjoins on a wide 
variety of queries. Red Brick's dynamic incre- 
mental optimizer automatically selects the best 
algorithm for each query. Red Brick's intelligent 
parallelism-on-demand provides optimal per- 
formance in a parallel processing environment 
and uses the hardware most efficiently for each 
query. 
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SUMMARY 

Data warehouses requite different schemas than 
those used by OLTP systems; data warehouses 
need to provide a fast response to ad hoc, often 
complex queries on large amounts of data. 

Data warehouses need to support incessant, 
intensive ad hoc queries; the star schema is very 
well suited to this kind of demand, OUT data- 
bases are inherently ill-suited to the schemas and 
processing required by data warehousing; 
attempts by OLTP vendors to address their per~ 
formance problems fall short of the comprehen- 
sive and flexible solution that a robust data ware- 
house requires. 

Red Brick's RDBMS, with Industry-leading 
breadth and depth of solutions for star schema 
join processing, is an idea! choice for dataware- 
housing solutions. Its abilities to handle complex 
multi-faet-table schemas, assign appropriate 
degrees of parallelism, and dynamically choose 
STARindexes or other star schema processing 
based on the real characteristics of the data are 
unique in the data warehousing industry. 
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ABOUT BED BRICK SYSTEMS, INC. 

Red Brick Systems, Inc., based in Los Gatos, 
Calif., is a leading provider of higtvpeiformance 
relational database products for data warehous- 
ing. Its flagship product. Red Brick Warehouse 
5.0, is the world's fastest and most scalable rela- 
tional database for data warehousing, including 
data marts, online analytical processing (OLAP), 
and data mining. Red Brick customers are Suc- 
cessful because the company's high-performance 
products and service enable more users to ana- 
lyze more data and make better decisions faster. 
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